Building intuition into popgen concepts and simulation-based inference

Workshop on population and speciation genomics

Authors

Martin Petr

CC BY 4.0

Published

January 2025

First things first

These slides and other resources are (and always will be) at:

github.com/bodkan/cesky-krumlov-2025



Open this link now so you have everything at hand later.

Many problems in population genetics cannot be solved by a mathematician, no matter how gifted. [It] is already clear that computer methods are very powerful. This is good. It […] permits people with limited mathematical knowledge to work on important problems […].

Why use simulations?

  1. Making sense of inferred statistics
  2. Fitting model parameters (i.e. ABC)
  3. Ground truth for method development

Making sense of inferred statistics

Image from Peter (2016)

Making sense of inferred statistics

Fitting model parameters (i.e. ABC)

Image from Wikipedia on ABC

Ground truth for method development

Simulation software

The most famous and widely used are SLiM and msprime.

They are very powerful but both require:

  • quite a bit of programming knowledge,
  • a lot of code for non-trivial simulations (🐛🪲🐜).

The exercises will focus on the slendr
popgen simulation toolkit for R.


But let’s look at SLiM and msprime at least a little bit.

What is SLiM?

  • A forward-time simulator
  • It’s fully programmable!
  • Massive library of functions for:
    • demographic events
    • various mating systems
    • selection, quantitative traits, …

Modified from Alexei Drummond

Simple neutral simulation in SLiM

initialize() {
    // create a neutral mutation type
    initializeMutationType("m1", 0.5, "f", 0.0);
    // initialize 1Mb segment
    initializeGenomicElementType("g1", m1, 1.0);
    initializeGenomicElement(g1, 0, 999999);
    // set mutation rate and recombination rate of the segment
    initializeMutationRate(1e-8);
    initializeRecombinationRate(1e-8);
}

// create an ancestral population p1 of 10000 diploid individuals
1 early() { sim.addSubpop("p1", 10000); }

// in generation 1000, create two daughter populations p2 and p3
1000 early() {
    sim.addSubpopSplit("p2", 5000, p1);
    sim.addSubpopSplit("p3", 1000, p1);
}

// in generation 10000, stop the simulation and save 100 individuals
// from p2 and p3 to a VCF file
10000 late() {
    p2_subset = sample(p2.individuals, 100);
    p3_subset = sample(p3.individuals, 100);
    c(p2_subset, p3_subset).genomes.outputVCF("/tmp/slim_output.vcf.gz");
    sim.simulationFinished();
    catn("DONE!");
}

What is msprime?

  • A Python module for writing coalescent simulations
  • Extremely fast (genome-scale, population-scale data)
  • You must know Python fairly well to build complex models

Modified from Alexei Drummond

Simple simulation using msprime

import msprime

demography = msprime.Demography()
demography.add_population(name="A", initial_size=10_000)
demography.add_population(name="B", initial_size=5_000)
demography.add_population(name="C", initial_size=1_000)
demography.add_population_split(time=1000, derived=["A", "B"], ancestral="C")

ts = msprime.sim_ancestry(
  sequence_length=10e6,
  recombination_rate=1e-8,
  samples={"A": 100, "B": 100},
  demography=demography
)

www.slendr.net

Why a new package?

Spatial simulations!

Why a new package?

  • Most researchers are not expert programmers

  • All but the most trivial simulations require lots of code

  • 90% [citation needed] of simulations are basically the same!

    • create populations (splits and \(N_e\) changes)

    • specify if/how they should mix (rates and times)

  • Lot of code duplication across projects
slendr makes “standard” demographic simulations trivial
and unlocks new kinds of spatial simulations

Let’s get started

Everything we do will be in R



You can write your slendr code however you like, but I recommed using RStudio, for convenience.


Always start your R scripts with this:

library(slendr)
init_env()
The interface to all required Python modules has been activated.

(If you get a note about missing SLiM, safely ignore it for now.)

Typical steps of a slendr R workflow


  1. creating populations
  2. scheduling population splits
  3. programming \(N_e\) size changes
  4. encoding gene-flow events
  5. simulation sequence of a given size
  6. computing statistics from simulated outputs

Creating a population()

Each needs a name, size and time of appearance:

pop1 <- population("pop1", N = 1000, time = 1)

. . .


This creates a normal R object. Typing it out gives a summary:

pop1
slendr 'population' object 
-------------------------- 
name: pop1 
non-spatial population
stays until the end of the simulation

population history overview:
  - time 1: created as an ancestral population (N = 1000)

Programming population splits

Splits are indicated by the parent = <pop> argument:

pop2 <- population("pop2", N = 100, time = 50, parent = pop1)

. . .


The split is reported in the “historical summary”:

pop2
slendr 'population' object 
-------------------------- 
name: pop2 
non-spatial population
stays until the end of the simulation

population history overview:
  - time 50: split from pop1 (N = 100)

Scheduling resize events – resize()

Step size decrease:

pop1 <- population("pop1", N = 1000, time = 1)
pop1_step <- resize(pop1, N = 100, time = 500, how = "step")


Exponential increase:

pop2 <- population("pop2", N = 100, time = 50, parent = pop1)
pop2_exp <- resize(pop2, N = 10000, time = 500, end = 2000, how = "exponential")

Tidyverse-style pipe %>% interface

A more concise way to express the same thing as before.


Step size decrease:

pop1 <-
  population("pop1", N = 1000, time = 1) %>%
  resize(N = 100, time = 500, how = "step")


Exponential increase:

pop2 <-
  population("pop2", N = 1000, time = 1) %>%
  resize(N = 10000, time = 500, end = 2000, how = "exponential")

A more complex model

pop1 <- population("pop1", N = 1000, time = 1)

pop2 <-
  population("pop2", N = 1000, time = 300, parent = pop1) %>%
  resize(N = 100, how = "step", time = 1000)

pop3 <-
  population("pop3", N = 1000, time = 400, parent = pop2) %>%
  resize(N = 2500, how = "step", time = 800)

pop4 <-
  population("pop4", N = 1500, time = 500, parent = pop3) %>%
  resize(N = 700, how = "exponential", time = 1200, end = 2000)

pop5 <-
  population("pop5", N = 100, time = 600, parent = pop4) %>%
  resize(N = 50, how = "step", time = 900) %>%
  resize(N = 1000, how = "exponential", time = 1600, end = 2200)

Each object carries its history!

pop5
slendr 'population' object 
-------------------------- 
name: pop5 
non-spatial population
stays until the end of the simulation

population history overview:
  - time 600: split from pop4 (N = 100)
  - time 900: resize from 100 to 50 individuals
  - time 1600-2200: exponential resize from 50 to 1000 individuals

Gene flow / admixture

We can schedule gene_flow() from pop1 into pop2 with:

gf <- gene_flow(from = pop1, to = pop2, start = 2000, end = 2200, rate = 0.13)

. . .

Here rate = 0.13 means 13% migrants over the given time window will come from “pop1” into “pop2”.

. . .


Multiple gene-flow events can be gathered in a list:

gf <- list(
  gene_flow(from = pop1, to = pop2, start = 500, end = 600, rate = 0.13),
  gene_flow(from = ..., to = ..., start = ..., end = ..., rate = ...),
  ...
)

Last step before simulation: compile_model()


model <- compile_model(
  list(pop1, pop2, pop3, pop4, pop5),
  generation_time = 1,
  simulation_length = 3000,
  direction = "forward"
)


Compilation takes a list of model components, performs internal consistency checks, returns a single model object.

Last step before simulation: compile_model()


model <- compile_model(
  list(pop1, pop2, pop3, pop4, pop5),
  gene_flow = gf,      # <----- in case our model includes gene flow
  generation_time = 1,
  simulation_length = 3000,
  direction = "forward"
)


Gene flow(s) that we programmed are included
via the gene_flow argument.

Model summary

Typing the compiled model into R prints a brief summary:

model
slendr 'model' object 
--------------------- 
populations: pop1, pop2, pop3, pop4, pop5 
geneflow events: 1 
generation time: 1 
time direction: forward 
time units: generations 
total running length: 3000 time units
model type: non-spatial

configuration files in: /private/var/folders/70/b_q2zdh116b9pfg29p03sx600000gn/T/Rtmpb37jsY/file263c53d8925_slendr_model 

Model visualization

plot_model(model, proportions = TRUE)

So we build a model

How do we simulate data from it?

Built-in “simulation engines”

slendr has two built-in simulation “engine scripts”:

They are designed to understand slendr models.

. . .


All you need to simulate data is this one line of code:

ts <- msprime(model, sequence_length = 100e6, recombination_rate = 1e-8)


You don’t have to write msprime or SLiM code!

The result of a slendr simulation is a tree sequence (ts)

What is tree sequence?

  • a record of full genetic ancestry of a set of samples
  • an encoding of DNA sequence carried by those samples
  • an efficient analysis framework

Why tree sequence?


Why not VCF or a normal genotype table?

What we usually have

What we usually want

An understanding of our samples’ evolutionary history:

This is exactly what a tree sequence is!

The magic of tree sequences

They allow computing of popgen statistics without genotypes!

There is a “duality” between mutations and branch lengths.

See an amazing paper by Ralph et al. (2020) for more detail.

What if we need mutations though?

Coalescent and mutation processes can be decoupled!

What if we need mutations though?

Coalescent and mutation processes can be decoupled!

With slendr, we can add mutations after the simulation using ts_mutate().

Let’s take the model from earlier…

… and simulate data from it


In our script we’ll have something like this:

<... population() definitions ...>

model <- compile_model(...)
  
ts <-
  msprime(model, sequence_length = 50e6, recombination_rate = 1e-8)

… and simulate data from it


In our script we’ll have something like this:

<... population() definitions ...>

model <- compile_model(...)
  
ts <-
  msprime(model, sequence_length = 50e6, recombination_rate = 1e-8) %>%
  ts_mutate(mutation_rate = 1e-8)


Always use ts_mutate() during exercises! Otherwise you’ll get weird results due to the lack of mutations on the tree sequence.

So we can simulate data

How do we work with this ts thing?

Standard genotype formats

If a tree sequence doesn’t cut it, you can always:

  • export genotypes to a VCF file:
ts_vcf(ts, path = "path/to/a/file.vcf.gz")
  • export genotypes in an EIGENSTRAT format:
ts_eigenstrat(ts, prefix = "path/to/eigenstrat/prefix")
  • access genotypes in a data frame:
ts_genotypes(ts)
  pos pop1_1_chr1 pop1_1_chr2 pop1_2_chr1 pop1_2_chr2 pop1_3_chr1 pop1_3_chr2
1 123           0           0           0           0           0           0
2 569           0           0           0           0           0           0

slendr’s R interface to tskit statistics

Allele-frequecy spectrum, diversity \(\pi\), \(F_{ST}\), Tajima’s D, etc.

Find help at slendr.net/reference or in R under ?ts_fst etc.

Specifying individuals to compute on

We can get samples recorded in ts with ts_samples():

ts_samples(ts) %>% head(2)
# A tibble: 2 × 3
  name    time pop  
  <chr>  <dbl> <chr>
1 pop1_1  3001 pop1 
2 pop1_2  3001 pop1 

. . .

A shortcut ts_names() can also be useful:

ts_names(ts) %>% head(5)
[1] "pop1_1" "pop1_2" "pop1_3" "pop1_4" "pop1_5"

. . .

We can also get a per-population list of individuals:

ts_names(ts, split = "pop")
$pop1
[1] "pop1_996" "pop1_788" "pop1_458" "pop1_222" "pop1_229"

All slendr statistics accept this as their second (and further) arguments.


This is modelled after the sample_sets= argument of the respective tskit Python methods (except that you use “symbolic names” of individuals, not tree-sequence node numbers).

Example: nucleotide diversity

Get list of individuals in each population:

samples <- ts_names(ts, split = "pop")

names(samples)
[1] "pop1" "pop2"


samples$pop1 %>% head(3)
[1] "pop1_1" "pop1_2" "pop1_3"
samples$pop2 %>% head(3)
[1] "pop2_1" "pop2_2" "pop2_3"

 

Compute nucleotide diversity:

ts_diversity(ts, sample_sets = samples)
# A tibble: 2 × 2
  set   diversity
  <chr>     <dbl>
1 pop1  0.000405 
2 pop2  0.0000811

Example: allele frequency spectrum

Get names of all individuals:

samples <- ts_names(ts)[1:5]
samples
[1] "pop_1" "pop_2" "pop_3" "pop_4" "pop_5"


Compute the AFS:

afs <- ts_afs(ts, sample_sets = list(samples))

afs[-1]
 [1] 3917 2151 1432  941  740  624  607  587  416  385

 

plot(afs[-1], type = "b",
     xlab = "allele count bin",
     ylab = "frequency")

Working with “multi-population” statistics

For multipopulation statistics, you need a (named) list of samples recorded in each population.

ts_names() has a split = "pop" option just for this:

samples <- ts_names(ts, split = "pop")
samples
$<NA>
NULL

$<NA>
NULL

You can use this in place of sample_names in code like:

ts_diversity(ts, sample_sets = samples)

More information


  • slendr paper is now in PCI EvolBiol

  • documentation, tutorials is here

  • GitHub repo (bug reports!) is here


  • check out my new demografr inference package